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ABSTRACT 

FireDB (http://firedb.bioinfo.cnio.es) is a curated 
inventory of catalytic and biologically relevant 
small ligand-binding residues culled from the 
protein structures In the Protein Data Bank. Here 
we present the important new additions since the 
publication of FireDB in 2007. The database now 
contains an extensive list of manually curated bio- 
logically relevant compounds. Biologically relevant 
compounds are informative because of their role in 
protein function, but they are only a small fraction of 
the entire ligand set. For the remaining ligands, the 
FireDB provides cross-references to the annota- 
tions from publicly available biological, chemical 
and pharmacological compound databases. FireDB 
now has external references for 95% of contacting 
small ligands, making FireDB a more complete 
database and providing the scientific community 
with easy access to the pharmacological annota- 
tions of PDB ligands. In addition to the manual 
curation of ligands, FireDB also provides insights 
into the biological relevance of individual binding 
sites. Here, biological relevance is calculated from 
the multiple sequence alignments of related binding 
sites that are generated from all-against-all com- 
parison of each FireDB binding site. The database 
can be accessed by RESTful web services and is 
available for download via MySQL. 

INTRODUCTION 

The growth in protein sequence and structural databases 
is accelerating thanks to genome sequencing projects (1) 
and structural genomics initiatives (2). This rapid growth 



of the primary databases is generating an enormous 
quantity of potentially interesting data. Secondary data- 
bases that can analyse and process this information and 
present it in a usable form are necessary to allow us to 
make use of this wealth of new biological data. 

Much of the untapped functional information in the 
main repository for protein 3D structures, the Protein 
Data Bank [PDB, (3)], can be found at the residue level 
in the form of the amino acid residues involved in ligand 
binding and implicated in catalysis. Functional informa- 
tion at the residue-level, such as the amino acid residues 
implicated in protein-protein interactions and in molecu- 
lar function, can be of crucial importance in the elucida- 
tion of protein function. Pinpointing catalytic residues and 
hgand-binding sites by computational means provides 
vital clues for the design of targeted biochemical experi- 
ments, and could play a role in drug design and screening. 

The PDB database is the largest source of these func- 
tionally important residues. FireDB (4), a database of 
hgand binding and catalytic residues culled from the 
protein structures deposited in the PDB, was developed 
specifically to make use of the PDB ligand-binding data. 

FireDB is more than a simple repository of PDB 
residue-Ugand contacts, it also attempts to bring some 
order to the protein-ligand interactions; many ligands in 
the deposited structures in the PDB do not have any strict 
biological meaning and FireDB puts a value on the bio- 
logical importance of each protein-hgand interaction. The 
separation of biological and non-biological hgands in the 
PDB is a major issue when defining what a binding site is. 
This definition is especially difficult for small organic or 
inorganic molecules and ions that can be biologically 
important in some cases, while in others may simply be 
crystallized along with the protein structure as part of the 
buffer or solvent. 

Many ligand databases attempt to divide Hgands into 
biological and non-biologically relevant and different 
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approaches have been used to deal with this problem, gen- 
erally based on the nature of the Ugand. LigASite (5) uses 
the size of ligand and characteristic of the binding site 
(>10 heavy atoms and >70 inter-atomic contacts with 
protein) in order to filter out uninteresting binding sites. 
But this strict approach leaves out ions and small mol- 
ecules that are known to be important for the structure 
and function of proteins. The most recent version of 
the database contains annotation for 391 non-redundant 
data entries and a total of 1194 unique hgands. Binding 
MOAD (6) uses pre-estabhshed criteria (manually curated 
hsts) but does not take metals into account. The latest 
version contains 21109 proteins in contact with 10156 
different ligands. The recently introduced BioLip (7) has 
a more sophisticated composite automated and manual 
procedure in order to avoid the loss of information. It is 
updated weekly and the August 2013 version contained 
56763 proteins and 11 185 unique ligands. No other 
database attempts to annotate biological relevance at the 
level of individual binding sites. 

Computer predictions of functional residues have now 
become an integral part of the process of protein function 
determination. Many functional residue prediction 
methods have been developed in recent years (8-11) and 
the most effective methods involve some form of homolo- 
gous transfer of ligand-binding data. FireDB has a com- 
panion web server, fires tar (12,13) that bases its ligand 
binding and catalytic residue predictions on the binding 
sites in FireDB. 

Here we present the new developments in FireDB. 
A number of new features have been incorporated into 
the database extending substantially its coverage and 
making improvements to the quahty of the FireDB 
ligand annotations and to the usabihty of FireDB data. 



CONTENTS 

FireDB brings together ligands crystalhzed in PDB struc- 
tures, the residues in contact with those ligands, and the 
catalytic sites annotated by hand in the Catalytic Site 
Atlas (14). FireDB also incorporates detailed information 
on each ligand along with two tools, SQUARE (15) and 
firestar. FireDB clusters are cross-linked with UniProt 
accession codes (16) EC enzyme numbers (17) via MSD 
(18) and GO terms from GOA-PDB (19). The general 
flowchart is available in the on-hne documentation. 

The ligands in FireDB are extracted from the nimCIF 
(20) data file. Since FireDB is oriented towards small 
molecule hgands, interactions with proteins, and DNA 
and RNA interactions are excluded, as are large ligands 
such as photosystems where the number of hgand atoms is 
two thirds or greater than the number of protein atoms. 
In addition many solvent molecules are filtered out at an 
early stage. The remaining hgands are tagged as metal or 
non-metal depending on the nature of the hgand. Ligands 
are cross-linked with the publicly available chemical data- 
bases as detailed below. FireDB defines residues in contact 
with hgand as those atom-atom distances <0.5A plus 
Van Der Walls radii cut-off. 



In order to reduce the redundancy inherent in the PDB, 
hgands, binding residues and CSA catalytic residues are 
associated to FireDB master sequences. Master sequences 
are consensus sequences generated by clustering all 
PDB chains at 97% sequence identity using CD-HIT 
(21), and building multiple sequence alignments with 
MUSCLE (22). 

The database schema has been updated in order to 
integrate the new features in FireDB and to discard infor- 
mation considered not useful. A complete graphical 
schema of the database structure is now available in the 
onhne documentation; full descriptions of tables are also 
provided. 

Collapsing binding sites 

Multiple binding sites in the same FireDB cluster are 
collapsed together into master sequence binding sites 
(MSS) if they overlap over at least of 60% of the 
binding residues. Overlapping binding sites are clustered 
even if the hgands in the sites are different, although 
overlapping metal and non-metal-binding sites are 
always collapsed independently. Catalytic residues from 
the CSA in FireDB clusters are also collapsed into MSS 
in the same way as the ligand-binding sites. This means 
that there are three types of MSS, metal, non-metal and 
catalytic. 

The MSS is composed of the residues from all sites that 
make up the MSS. Residues in the MSS are given an 
occupancy score, calculated from the frequency with 
which each residue is in contact with a ligand in each of 
the separate sites that make up the MSS. 

The reduction of multiple sites into a single MSS is a 
key step in the construction of the database and can 
provide much information on its own. First of all, the 
comparison between the constituent binding sites can 
shed light on hgand flexibihty [especially for co-enzymes 
such as ADP/ATP, (23)] and second it allows comparison 
of the different residues involved in binding different 
Hgands in the same binding site. 

Functionality 

Users of FireDB can retrieve the detailed annotations 
for each MSS via PDB code, UniProt accession code or 
associated keywords. Detailed hgand information can be 
retrieved via the mmCIF three-letter code or keywords. 

FireDB in numbers 

FireDB has grown with the PDB. The first stable version 
of FireDB (July 2006) had 76 504 protein chains. The 
latest version (August 2013) has 224 691 chains, with 
binding site annotations for 141 199, and 16 661 annotated 
PDB ligands. In total there are 116 514 non-redundant 
hgand-binding MSS and 11416 catalytic site MSS. The 
most recent version of FireDB contains 26287 master se- 
quences with at least one MSS. A comparison between the 
September 2006 and August 2013 releases can be seen in 
Table 1. 

FireDB annotates binding sites and Hgands for more 
than a quarter of sequence space. FireDB master sequences 
covered 6519 out of the 14381 PfamA families in the 
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Table 1. The growth of FireDB, a comparison of the September 2006 
and August 2013 releases 
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March 2013 release of Pfam and 4246 of these families were 
annotated with ligands. The functional residue prediction 
server fires tar that is attached to FireDB can extend pre- 
dicted binding sites to a total of over 5400 PfamA families 
(approximately 40%, data unpublished). 

NEW ADDITIONS AND IMPROVEMENTS in FireDB 

Biological activity of PDB Ligands 

FireDB categorizes all protein-small ligand interactions in 
the PDB. The PDB contains a diverse range of bound 
ligands, but many PDB ligands have little biological rele- 
vance. The structures deposited in the PDB often contain 
solvents and non-biological molecules that are part of the 
crystallization conditions. In addition there are many com- 
pounds that would not be found under strictly biological 
conditions, but that are nonetheless interesting, such as an- 
tagonists, inhibitors and other drugs. In order to deal with 
this range of functions all ligands in the FireDB repository 
are now classified in terms of their biological relevance 
and placed in one of three classes: 'COGNATE', 'NON- 
COGNATE' or 'AMBIGUOUS' based on manual 
curation and exhaustive literature searches. 

COGNATE 

The aim of FireDB is to annotate functionally important 
residues and for this reason the information obtained from 
natural occurring protein-ligand interactions are given 
more weight in FireDB. COGNATE compounds are 
those that are the natural biological ligands of the 
protein they are in contact with. These ligands wiU be 
metal ions, co-factors, substrates and/or products). 

A great deal of effort has gone into expanding the avail- 
able manual annotation of FireDB ligands. Ligands have 
been annotated as COGNATE by the database curators 
based on extensive hterature searches and based on the 
role of each ligand in the PDB structure it is crystallized 
with. The August 2013 release of FireDB has a list of 655 
natural biological compounds and as far as we know this 
is the largest similar list. 

AMBIGUOUS 

Biologically relevant compounds are informative because 
of their role in protein function, but they are a small 



fraction of the entire compound set. And in some cases 
cognate compounds can also act as non-biological hgands. 
Those compounds that can be biological ligands, but that 
are also often found as part of the crystallization condi- 
tions are defined as AMBIGUOUS. For example sucrose 
(Figure 1) is present as a hgand in more 190 entries in the 
PDB but is often used as buffer in crystallization solu- 
tions. So far there are 54 compounds in the fist of 
AMBIGUOUS ligands. 

NON-COGNATE 

AU other compounds. FireDB provides cross-referencing 
of PDB ligands to publicly available biological, chemical 
and pharmacological compound databases for the non- 
cognate hgands in the PDB. Complete searchable fists of 
aU these hgands are available on the website. 

Extended annotations for NGN COGNATE ligands 

The vast majority of the PDB compounds are classified as 
NON COGNATE. As of August 2013, there were 14 319 
compounds in contact with at least one chain in FireDB; 
13 610 of these were annotated as NON-COGNATE com- 
pounds. They often have little or no easily accessible 
annotation. 

There are many public chemical and pharmacological 
databases that store a diverse range of data about 
chemical compounds. Unfortunately it is often difficult 
to cross-reference this valuable information with PDB 
hgands. The RCSB PDB have made efforts to map com- 
pounds to some of these databases, but so far only direct 
associations to DrugBank (24) are shown on the web page 
and this only covers 33% of the compounds: for other 
databases a search fink is provided, making the collection 
of information complicated. 

We have developed a pipeline for the automatic cross 
matching of PDB ligands with information gathered from 
well-known publicly available compound databases. We 
selected the KEGG COMPOUND database (25) and 
MetaCyc (26) in order to cross-reference possible new 
COGNATE compounds. We selected ChEMBL (27), 
chEBI (28), DrugBank and KEGG DRUG (25) because 
they are known repositories of molecules with pharmaco- 
logical activity. And finally we used PubChem (29,30) 
because it is one of the most complete databases of 
smaU molecules information. PubChem also provides bio- 
logical activity annotation. 

The pipehne was able to assign an external reference to 
approximately 95% of PDB compounds with at least one 
contact in the FireDB database. SmaU ligands in FireDB 
now have external references to PubChem, KEGG 
COMPOUND, MetaCyc, ChEMBL, chEBI, DrugBank 
or KEGG DRUG and FireDB also stores annotations 
of biological activity by cross-linking with PubChem, 
KEGG DRUG or DrugBank when available. 

PubChem is the major contributor, foUowed by 
ChEMBL and DrugBank. Additionally we retrieved 
pharmacological annotations from PubChem, DrugBank 
and KEGG DRUG, obtaining at least one pharmaco- 
logical description for 1300 compounds. For those com- 
pounds for which we had little or no information, we 
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Figure 1. Biological and non-bioIogical binding of sucrose. (A) sucrose-binding site of a bacterial levan fructotransferase (PDB: 4FFH). This site is 
described in the associated paper as biological and sucrose is the substrate. FireDB annotations for the ligands in the collapsed MSS are shown in the 
upper right panel. 'E' represents the number of evolutionarily related sites found in FireDB and the percentage shows the occupancy, calculated from 
the number of chains in the MSS that bind ligands at this site (in this case 2 of 16 chains). (B) sucrose-binding site of a bacterial tRNA-splicing ligase 
(PDB: 4DWR). This site is not biological, and sucrose is cited as part of the crystallization mix in the related paper. The FireDB data for this MSS 
shows that the occupancy is high (9 of 11 chains) but there are no homologous sites (E = 0). (C) Clicking on the 'E = 8' icon in (A) will lead the user 
to the alignment of evolutionarily related sites for the levan fructotransferase-binding site (4FFH). Residues in the binding site alignment are 
coloured: the darker is the blue, the more conserved the position. Much of the site is conserved even in distant hoinologous sites and the nature 
of the bound ligands (Ligand Compounds) suggests that this is a biologically important sugar-binding site. 



began a process of manual curation. So far over 300 non- 
cognate compounds have been manually curated and we 
have added direct scientific references, and data about 
activity and target organisms. An example of the new 
Ugand web pages can be seen in Figure 2. 

This annotation effort makes FireDB a more complete 
database, flUing a need for the annotation of pharmaco- 
logical information on PDB ligands, and also offering 
users the possibility of exploring the collapsed MSS 
from a pharmacological point of view. This effort is espe- 
cially important because, apart from the direct informa- 
tion itself, it also allows homology-based function 
prediction methods such as firestar to make predictions 
for drug-binding sites. 

Evolutionarily related sites 

At the binding site level FireDB now includes evolution- 
ary analyses of binding site residues. The evolutionary in- 
formation used in the biological relevance analysis comes 
from running biological residue prediction server firestar 
in an all-against-all mode for each FireDB master 
sequence. The firestar searches allow FireDB to cluster 



together related MSS. Multiple ahgnments of homologous 
MSS are generated when the detected sites overlap for 
40% of the annotated residues. The alignments form the 
basis of the calculation biological relevance of individual 
binding sites in FireDB and this information is an import- 
ant aid in the prediction of functional residues by the 
firestar server. 

In Figure 1 we show the information retrieved from 
FireDB for two sucrose-binding proteins (4FFH and 
4DWR). For the biological binding site (from 4FFH) 
the occupancy is only 12%, but there are evolutionarily 
related binding sites for the sucrose. Indeed there is a core 
of well-conserved residues that bind sugars in the evolu- 
tionarily related binding sites even when they are remote 
homologues. The evolutionarily related binding sites 
confirm the biological role of sucrose in 4FFH. 

For non-biological hgand (from 4DWR) the occupancy 
is 81%. However, there are no evolutionarily related 
binding sites for the sucrose in 4DWR. Although the 
sucrose in 4DWR is non-biological the high occupancy 
of the site in homologous proteins suggests that this site 
may have some ligand-binding role. It should also be 



Nucleic Acids Research, 2014, Vol. 42, Database issue D271 



Summary 

Pdb id: 

Chemistry type: 
Class: 

Mol_weight: 

Fire DB_h its: 

biological_tag: 

metal_tag: 

Synonim: 

Formula: 

Common_nBme : 

ISO.SMILE: 



kdlclc to ClOM 



0 = 



APW 

NON-POLYMER 
HETAIN 
44B.S0S 
1 

Non Cognate 
Non Metal 

MAGNESIUM-5-ADENYLY-BETA-AMIDO-DIPHOS 

CIO H14 Mg N6 09 P2 

<5-0-[(R)-<[(S)-AMIN0(HYDR0XY- 

KAPPAC)PH0SPH0RYL]OXY}(HYDROXY- 

KAPPAC)PH0SPH0RYL]ADEN0SINAT0(2-)>MA( 

Nclncnc2n(cncl2)[C®©H]10[C(g>H] 

(CO[Pe]2(=0)0[MglO[P»](NK=0)02)[CI9©H 

(0)[C®H]10 



I \ 



string 



OH, 



F — Mg^-- F 



I 



bbloglcal_tag; 
metal_tag: 

ID: 

Common„name: 

Synonim; 
FireDB_hits: 
b)ological„tag: 
metaLtag: 

n>: 

Cofnmon_naine: 

Synonim: 

RreDB_hits: 

bloiogicai_tag: 

metai_iag: 



Non Cognate 
Non Metai 
APW 

{5-0-[(R)-{[(S)-AMINC(HYDR0XY-KAPPA0)PH0SPH0RYL)0XY} 
(HYDROXY- 

i<APPA0)PH0SPH0RYL]ADEN0SlNAT0(2-)}MAGNESIUM 

MAGNESIUM-5-ADENYLY-BETA-AMIDO-DIPHOSPHATE 

1 

Non Cognate 
Non Metal 
MF4 

TFrRAFLU0R0MAGNESATE(2-) 
MAGNESIUMTETRAFLUORIDE 
14 

Non Cognate 
Non Metal 




External Refs 



click to open 



BINDING_SITES 



click to open 



Figure 2. A screen capture from the new ligand information web pages in FireDB. The user can directly query the database using the PDB Hgand 
three-letter code or can search using a keyword. Searches with keywords generate a window with the result of the search (right). Information is 
shown in pull-down tabs. General information is shown in the summary tab, and an additional three tabs are generated when information is 
available. The external references tab appears when a match with an external database has been found; a manual references tab is generated if 
manual annotation has been collected from the literature. 



pointed out that due to the heterogeneous distribution of 
the structures deposited in the PDB, the absence of evo- 
lutionarily related binding sites does not automatically 
imply that a binding site is not biological. 

We have automatically evaluated the biological rele- 
vance of all MSS in the current release. We used 
SQUARE to identify conserved residues and motifs in 
the /7>ey?ar-generated multiple alignments for each MSS. 
MSS that bound cognate hgands and that had at least a 
core of conserved ligand-binding residues were considered 
biologically relevant. Specific amino acid composition 
filters were used for MSS that bound metal ligands. 

Out of 116 514 ligand-binding MSS in the current 
release, 64 896 have at least one homologue and of these 
10 320 non-metal and 6976 metal MSS were tagged as 
biologically relevant. Beyond this we were able to tag 
another 1393 MSS as 'novef biologically relevant sites. 
These MSS were those that did not have homologous 
MSS, but where all other features (cognate hgand, 
residue composition) pointed to their biological activity. 

These biologically relevant and novel MSS combined 
with the catalytic site MSS mean that FireDB contains a 
total of approximately 30000 biologically relevant MSS. 
The entire set of biologically relevant ligand-binding MSS 
can be downloaded from FireDB. Further information on 
the decision-making process involved in determining bio- 
logical relevance can be found on the web pages. 

RESTFUL web services 

FireDB is freely available via the web. The database is 
available as a MySQL dump, and we have also developed 
RESTFUL web services to make the resource easier and 



faster to access for the scientific community. All the anno- 
tations are easily retrievable. An example script is available 
at http://firedb.bioinfo.cnio.es/rest/FireDB_rest.pl 



ROOM FOR FUTURE IMPROVEMENTS 

Like ah inventories that base annotations on experimental 
data, FireDB information and quality could be increased 
by adding other rehable sources of annotated functional 
residues beyond those in the PDB. We would like to inte- 
grate other sources of experimentally annotated function- 
ally important residues where they exist. 

At present FireDB only contains protein-small 
molecule Hgand interactions. We are looking into ways 
of including protein-protein and protein-DNA inter- 
actions and also information from post-translational 
modifications and mutations to extend the coverage of 
FireDB. 

We will continue to add to the manual curation of 
hgands and in particular extend the hterature hnks 
where possible. 
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